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Abstract 

In this paper we consider learning in passive setting but with a shght modification. 
We assume that the target expected loss, also referred to as target risk, is provided 
in advance for learner as prior knowledge. Unlike most studies in the learning theory 
that only incorporate the prior knowledge into the generalization bounds, we are able to 
explicitly utilize the target risk in the learning process. Our analysis reveals a surprising 
result on the sample complexity of learning: by exploiting the target risk in the learning 
algorithm, we show that when the loss function is both strongly convex and smooth, 
the sample complexity reduces to 0(log (7)), an exponential improvement compared to 
the sample complexity 0{^) for learning with strongly convex loss functions. Further- 
more, our proof is constructive and is based on a computationally efficient stochastic 
optimization algorithm for such settings which demonstrate that the proposed algorithm 
is practically useful. 

1 Introduction 

In the standard passive supervised learning setting, the learning algorithm is given a set 
of labeled examples S = ((xi,yi),--- ,(x„,2/ji)) drawn i.i.d. from a fixed but unknown 
distribution 2?. The goal, with the help of labeled examples, is to output a classifier h from 
a predefined hypothesis class 'H that does well on unseen examples coming from the same 
distribution. The sample complexity of an algorithm is the number of examples which is 
sufficient to ensure that, with probability at least 1 — S (w.r.t. the random choice of S), the 
algorithm picks a hypothesis with an error that is at most e from the optimal one. Sample 
complexity of passive learning is well established and goes back to early works in the learning 
theory where the lower bounds Q (^(log ^ + log |)) and (^(log ^ + log |)) were obtained 
in classic PAC and general agnostic PAC settings, respectively [HIISIII]. 

In light of no free lunch theorem, learning is impossible unless we make assumptions re- 
garding the nature of the problem at hand. Therefore, when approaching a particular learning 
problem, it is desirable to take into account some prior knowledge we might have about our 
problem and use a specialized algorithm that exploits this knowledge into a learning process 
or theoretical analysis. A key issue in this regard is the formalization of prior knowledge. 
Such prior knowledge can be expressed by restricting our hypothesis class, making assump- 
tions on the nature of unknown distribution 2? or formalization of the data space, analytical 
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properties of the loss function being used to evaluate the performance, sparsity, and margin- 
to name a few. 

There has been an upsurge of interest over the last decade in finding tight upper bounds 
on the sample complexity by utilizing prior knowledge on the analytical properties of the 
loss function, that led to stronger generalization bounds in agnostic PAC setting. In [T7] fast 
rates obtained for squared loss, exploiting the strong convexity of this loss function, which 
only holds under pseudo-dimensionality assumption. With the recent development in online 
strongly convex optimization [TT], fast rates approaching 0(1 logy) for convex Lipschitz 
strongly convex loss functions has been obtained in [221 US] • For smooth non- negative loss 
functions, [27j improved the sample complexity to optimistic rates 



for non-parametric learning using the notion of local Radcmacher complexity [5] , where eopt 
is the optimal risk. 

In this work, we consider a slightly different setup for passive learning. We assume that 
before the start of the learning process, the learner has in mind a target expected loss, also 
referred to as target risk, denoted by ep,.ioi0, and tries to learn a classifier with the expected 
risk of O(eprior) by labeling a small number of training examples. We further assume the 
target risk Cprior is feasible, i.e., eprior > Copt- To address this problem, we develop an efficient 
algorithm, based on stochastic optimization, for passive learning with target risk. The most 
surprising property of the proposed algorithm is that when the loss function is both smooth 
and strongly convex, it only needs O(dlog(l/eprioi)) labeled examples to find a classifier 
with the expected risk of O(eprior), where d is the dimension of data. This is a significant 
improvement compared to the sample complexity for empirical risk minimization. 

The key intuition behind our algorithm is that by knowing target risk as prior knowledge, 
the learner has better control over the variance in stochastic gradients, which contributes 
mostly to the slow convergence in stochastic optimization and consequentially large sample 
complexity in passive learning. The trick is to run the stochastic optimization in multistages 
with a fixed size and decrease the variance of stochastically perturbed gradients at each itera- 
tion by a properly designed mechanism. Another crucial feature of the proposed algorithm is 
to utilize the target risk eprior to gradually refine the hypothesis space as the algorithm pro- 
ceeds. Our algorithm differs significantly from standard stochastic optimization algorithms 
and is able to achieve a geometric convergence rate with the knowledge of target risk eprior- 

We note that our work does not contradict the lower bound in [37] because a feasible 
target risk eprior is given in our learning setup and is fully exploited by the proposed algo- 
rithm. Knowing that the target risk eprior is feasible makes it possible to improve the sample 
complexity from ©(l/eprior) to C'(log(l /eprior))- We also note that although the logarithmic 
sample complexity is known for active learning [101 [5] , we are unaware of any existing passive 
learning algorithm that is able to achieve a logarithmic sample complexity by incorporating 
any kind of prior knowledge. 



use Cprior instead of e to empliasize the fact that this parameter is known to the learner in advance. 
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1.1 More Related Work 



Stochastic Optimization and Learnability Our work is related to the recent stud- 
ies that examined the learnabihty from the viewpoint of stochastic convex optimization. 
In [551 US] I the authors presented learning problems that are learnable by stochastic convex 
optimization but not by empirical risk minimization (ERM). Our work follows this line of 
research. The proposed algorithm achieves the sample complexity of O(dlog(l/epiior)) by ex- 
plicitly incorporating the target expected risk eprior into the stochastic convex optimization 
algorithm. It is however difficult to incorporate such knowledge into the framework of ERM. 
Furthermore, it is worth noting that in [23l [281 ESI H] , the authors explored the connection 
between online optimization and statistical learning in the opposite direction. This was done 
by exploring the complexity measures developed in statistical learning for the learnability of 
online learning. 

Online and Stochastic Optimization The proposed algorithm is closely related to the 
recent works that stated 0(l/7i) is the optimal convergence rate for stochastic optimization 
when the objective function is strongly convex [T31[T^[^. In contrast, the proposed algorithm 
is able to achieve a geometric convergence rate for a target optimization error. Similar to 
the previous argument, our result does not contradict the lower bound given in |12| because 
of the knowledge of a feasible optimization error. Moreover, in contrast to the multistage 
algorithm in [12] where the size of stages increases exponentially, in our algorithm, the size 
of each stage is fixed to be a constant. 

Outhne The remainder of the paper is organized as follows: In Section [51 we set up nota- 
tion, describe the setting, and discuss the assumptions on which our algorithm relies. Sec- 
tion[3]motivatcs the problem and discusses the main intuition of our algorithm. The proposed 
algorithm and main result are discussed in Section [H We prove the main result in Section [51 
Section [5] concludes the paper and the appendix contains the omitted proofs. 

2 Preliminaries 

As usual in the framework of statistical learning theory, we consider a domain Z := X x y 
where A" C R"* is the space for instances and y is the set of labels, and H is a hypothesis 
class. We assume that the domain space Z is endowed with an unknown Borel probability 
measure V. We measure the performance of a specific hypothesis h by defining a nonnegative 
loss function £ : Hx Z ^ R+. We denote the risk of a hypothesis h by C{h) = Ez;^i5[£(/i, z)]. 
Given a sample S = (zi,--- ,z„) = ((xi, yi), • • • ,(x„,y„)) ^ P", the goal of a learning 
algorithm is to pick a hypothesis h : X ^ y from Ji in such a way that its risk C{h) is close 
to the minimum possible risk of a hypothesis in Ji. 

Throughout this paper we pursue stochastic optimization viewpoint for risk minimization 
as detailed in Section [31 Precisely, we focus on the convex learning problems for which we 
assusmc that the hypothesis class H is a parametrized convex set H = {/iw : x i— )■ (w,x) : 
w G M'',j|wj| < R} and for all z = (x, y) E Z, the loss function i{-,z) is a non-negative 
convex function. Thus, in the remainder we simply use vector w to represent h^, rather than 
working with hypothesis h^. We will assume throughout that A" C M'' is the unit ball so 
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that II x|] < 1. Finally, the conditions under which we can get the desired result on sample 
complexity depend on analytic properties of the loss function. In particular, we assume that 
the loss function is strongly convex and smooth [2D] . 

Definition 1 (Strong convexity). A loss function £(w) is said to be a-strongly convex w.r.t 
a norm \\ ■ \\, if there exists a constant a > (often called the modulus of strong convexity) 
such that, for any A G [0, 1] and for all wi, W2 G it holds that 

e{Xwi + (1 - X}w2) < a£{wi) + (1 - A)£(w2) - ^Xil - A)a||wi - W2f . 
When ^(w) is differentiable, the strong convexity is equivalent to ^(wi) > £(w2) + 

(V^(W2), Wi - W2) + f ||wi - W2||^, V Wi, W2 e v.. 

Definition 2 (Smoothness). A differentiable loss function ^(w) is said to be P -smooth with 
respect to a norm \\ ■ \\, if it holds that 

^(wi) < ^(w2) + (V£(w2), wi - W2) + ^||wi - W2IP, V wi, W2 e y.. (1) 

Such functions arise, for instance, in logistic and least-squares regression, and in general 
for learning linear predictors where the loss function has a Lipschitz-continuous gradient. 
Adding strongly convex || • |p rcgularizcr to the mentioned smooth loss functions makes them 
appropriate for our setting. 

3 The Curse of Stochastic Oracle 

Wc begin by discussing stochastic optimization for risk minimization, convex Icarnability, 
and then the main intuition that motivates this work. 

Most existing learning algorithms follow the framework of empirical risk minimizer (ERM) 
or regularized ERM, which was developed to great extent by Vapnik and Chervonenkis [3D]. 
Essentially, ERM methods use the empirical loss over S, i.e., >C(w) = i Y^^=i ^(w, Zj), as a 
criterion to pick a hypothesis. In regularized ERM methods, the learner picks a hypothesis 
that jointly minimizes >C(w) and a regularization function over w. We note that ERM re- 
sembles the widely used Sample Average Approximation (SAA) method in the optimization 
community when the hypothesis space and the loss function are convex. If uniform conver- 
gence holds, then the empirical risk minimizer is consistent, i.e., the population risk of the 
ERM converges to the optimal population risk, and the problem is learnable using ERM. 

A rather different paradigm for risk minimization is stochastic optimization. Recall that 
the goal of learning is to approximately minimize the risk £(w) = Ez^x)[^(w, z)]. However, 
since the distribution 2? is unknown to the learner, we can not utilize standard gradient meth- 
ods to minimize the expected loss. Stochastic optimization methods circumvent this problem 
by allowing the optimization method to take a step which is only in expectation along the 
negative of the gradient. To motivate stochastic optimization as an alternative to the ERM 
method, [2S1 [21] challenged the ERM method and showed that there is a real gap between 
learnability and uniform convergence by investigating non-trivial problems where no uniform 
convergence holds, but they are still learnable using Stochastic Gradient Descent (SGD) al- 
gorithm [18] . These results uncovered an important relationship between learnability and 
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stability, and showed that stabihty together with approximate empirical risk minimization, 
assures learnability . We note that Lipschitzness or smoothness of loss function is neces- 
sary for an algorithm to be stable, and boundcdncss and convexity alone are not sufficient 
for ensuring that the convex learning problem is learnable. 

To directly solve minwew>C(w) = Ez^i5[£(w, z)], a typical stochastic optimization algo- 
rithm initially picks some point in the feasible set T-L and iteratively updates these points 
based on first order perturbed gradient information about the function at those points. For 
instance, the widely used SGD algorithm starts with wq = 0; at each iteration i, it queries 
the stochastic oracle {SO) at Wt to obtain a perturbed but unbiased gradient gt and updates 
the current solution by 

wt+i = n-H (wt - i]tgt) , 

where Il-u{w) projects the solution w into the domain H. To capture the efficiency of opti- 
mization procedures in a general sense, one can use oracle complexity of the algorithm which, 
roughly speaking, is the minimum number of calls to any oracle needed by any method to 
achieve desired accuracy [5D] . Wc note that the oracle complexity corresponds to the sample 
complexity of learning from the stochastic optimization viewpoint previously discussed. The 
following theorem states a lower bound on the sample complexity of stochastic optimization 
algorithms [TO] . 

Theorem 3 (Lower Bond on Oracle Complexity). Suppose £(w) = Ez;^x)[£(w, z)] is a- 
strongly and ^-smooth convex function defined over convex domain %. Let SO he a stochastic 
oracle that for any point w S H returns an unbiased estimate g, i.e., E[g] = V£(w), such 
that E [||g — V£(w)|p] < (T^ holds. Then for any stochastic optimization algorithm A to find 
a solution w with e accuracy respect to the optimal solution w,, i.e., E [£(w) — £(w,)] < e, 
the number of calls to SO is lower bounded by 
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The first term in Q comes from deterministic oracle complexity and the second term is 
due to noisy gradient information provided by SO. As indicated in the slow convergence 
rate for stochastic optimization is due to the variance in stochastic gradients, leading to at 
least O (cr^/e) queries to be issued. We note that the idea of mini-batch [3 [5], although it 
reduces the variance in stochastic gradients, does not reduce the oracle complexity. 

We close this section by informally presenting why logarithmic sample complexity is, in 
principle, possible, under the assumption that target risk is known to the learner A. To this 
end, consider the setting of Theorem [3] and assume that the learner A is given the prior 
accuracy eprior and is asked to find an Cprior-accurate solution. If it happens that the variance 
of SO has the same magnitude as eprior, i-e., E [|jg — V£(w)|p] < eprior, then from ^ it 
follows that the second term vanishes and the learner A needs to issue only O (log 1/eprior) 
queries to find the solution. But, since there is no control on SO, except that the variance of 
stochastic gradients are bounded, A needs a mechanism to manage the variance of perturbed 
gradients at each iteration in order to alleviate the influence of noisy gradients. One strategy 
is to replace the unbiased estimate of gradient with a biased one, which unfortunately may 
yield loose bounds. To overcome this problem, we introduce a strategy that shrinks the 
solution space with respect to the target risk eprior to control the damage caused by biased 
estimates. 
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4 Algorithm and Main Result 



In this section wc proceed to describe the proposed algorithm and state the main result on 
its sample complexity. 

4.1 Description of Algorithm 

We now turn to describing our algorithm. Interestingly, our algorithm is quite dissimilar 
to the classic stochastic optimization methods. It proceeds by running the algorithm online 
on fixed chunks of examples, and using the intermediate hypotheses and target risk epiior to 
gradually refine the hypothesis space. As mentioned above, we assume in our setting that the 
target expected risk eprior is provided to the learner a priori. We further assume the target 
risk Eprior IS feasible for the solution within the domain "H, i.e., Cprior ^ fopt- The proposed 
algorithm explicitly takes advantage of the knowledge of expected risk Cprior to attain an 
O (log(l/eprior)) sample complexity. 

Throughout we shall consider linear predictors of form (w, x) and assume that the loss 
function of interest £((w, x),?/) is /3-smooth. It is straightforward to see that £(w) = 
E(x.j,)^X' [^((w, x), y)] is also /3-smooth. In addition to the smoothness of the loss function, 
we also assume that £(w) to be a-strongly convex. We denote by the optimal solution 
that minimizes >C(w), i.e., — arg min^g-^ ZI(w) , and denote its optimal value by 6opt- 

Let (xt, i = 1, . . . , T be a sequence of i.i.d. training examples. The proposed algorithm 
divides the T iterations into the m stages, where each stage consists of Ti training examples, 
i.e., T = mTi. Let (x^, j/^) be the t-th training example received at stage k, and let rj be the 
step size used by all the stages. At the beginning of each stage k, we initialize the solution 
w by the average solution Wfe obtained from the last stage, i.e., 

1 

= — ^w*fe. (3) 

^ t=i 

Another feature of the proposed algorithm is a domain shrinking strategy that adjusts the 
domain as the algorithm proceeds using intermediate hypotheses and target risk. We define 
the domain Hk used at stage k as 

Hfc = {weH: l|w-wfe|| < Afc}, (4) 

where is the domain size, whose value will be discussed later. Similar to the SGD method, 
at each iteration of stage k, we receive a training example (x|,, y^), and compute the gradient 
Sfc = ^' (("^L ^fc)' y*) ^fc- Instead of using the gradient directly, following [13], a clipped 
version of the gradient, denoted by = clip (7fc, g^), will be used for updating the solution. 
More specifically, the clipped vector v| € is defined as 

K], = clip (7fe, [gl]^) = sign ([gy J min {-/k, \ [g^ J) , i = 1, . . . , d (5) 

where 7^ = 2^/3 Ak with ^ > 1. Given the clipped gradient v|., we follow the standard 
framework of stochastic gradient descent, and update the solution by 

w*+i = Iln, (wl, - r^v*) . (6) 



6 



Algorithm 1 Convex Learning with Target Risk 
1: Input: step size 77, stage size Ti, number of stages m, target expected risk epj-ior, param- 
eters e e (0, 1) and r G (0, 1) used for updating domain size A^, and parameter ^ > 1 
used to clip the gradients 

2: Initialization: wi = 0, Ai = R, and "Hi = H 
3: for fc = 1, . . . , TO do 

4: Set w*. = Wfc and 7^ = 2^/3 Ak 

5: for t = 1, . . . ,Ti do 

6: Receive training example (xt,?/t) 

7: Compute the gradient and the clipped version of the gradient using Eq. ([5]) 
8: Update the solution w|. using Eq. ([6]). 
9: end for 

10: Update Afc using Eq. ([7]). 

11: Compute the average solution w^+i according to Eq. and update the domain 'Hk+i 

using the expression in (jj]). 
12: end for 



The purpose of introducing the clipped version of the gradient is to effectively control the 
variance in stochastic gradients, an important step toward achieving the geometric conver- 
gence rate. At the end of each stage, we will update the domain size by explicitly exploiting 
the target expected risk Cprior as 



where e G (0, 1) and r £ (0, 1) are two parameters, both of which will be discussed later. 

Algorithm [T] gives the detailed steps for the proposed method. The three important 
aspects of Algorithm [1] all crucial to achieve a geometric convergence rate, are highlighted 
as follows: 

• Each stage of the proposed algorithm is comprised of the same number of training 
examples. This is in contrast to the epoch gradient algorithm |12j which divides to 
iterations into exponentially increasing epochs, and runs SGD with averaging on each 
epoch. Also, in our case the learning rate is fixed for all iterations. 

• The proposed algorithm uses a clipped gradient for updating the solution in order to 
better control the variance in stochastic gradients; this stands in contrast to the SGD 
method, which uses original gradients to update the solution. 

• The proposed algorithm takes into account the targeted expected risk and intermediate 
hypotheses when updating the domain size at each stage. The purpose of domain 
shrinking is to reduce the damage caused by biased gradients that resulted from clipping 
operation. 

4.2 Main Result on Sample Complexity 

The main theoretical result of Algorithm [1] is given in the following theorem. 




(7) 
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Theorem 4. Assume that the hypothesis space H is compact and the loss function £ is a- 
strongly convex and (3-smooth. Let T = mTi be the size of the sample and epnoi be the target 
expected loss given to the learner in advance such that Copt < Eprior holds. Given e S (0, 1) 
and r G (0, 1), set S,, rj, and Ti as 

, 4/? ^ ^ f ^^l3d + 2£,l3Vd^ ms 16£,^l3A 1 
4 = — , i 1 = 4 max In -—- , 



where 

l0g2 

^prior 

After running Algorithm[J\ over m stages, we have, with a probability 1 — S, 

/:(w,„+i) < + (l 



(8) 



prior; 

implying that only O(dlog[l/eprior]) training examples are needed in order to achieve a risk 

0/O(eprior)- 

Remark 5. The result given in Theorem^ assumes a bounded domain with ||w|| < R. 
This assumption can be lifted by effectively exploring the strong convexity of the loss function 
and further assuming that the loss function is Lipschitz continuous with constant G, i.e., 
l'C(wi) — £(w2)| < G||wi — W2II, V wi,W2 G %. More specifically, the fact that the C{yv) 
is a-strongly convex with first order optimality condition, for the optimal solution w, = 
argminwg-H 'C(w), we have 

£(w)-£(w*) > |||w-w,|p, Vwen. 

This inequality combined with Lipschitz continuous assumption implies that for any w G H 
the inequality ||w — w*|| < -R* := ^ holds, and therefore we can simply set R = R^. 

We note this dependency can also be resolved with a weaker assumption than Lipschitz 
continuity, which only depends on the gradient of loss function at origin. To this end, we 
define \i'{0,y) \ ~ G. Using the fact that Ciw) is a-strongly, it is easy to verify that 

|||w,f -Gl|w,|| <0, 

leading to ||w, j| < i?* := and, therefore, we can simply set R~R^. 
Remark 6. It is indeed remarkable that the sample complexity of Theorem^ has (/?/ 



dependency on the condition number of the loss function, which is worse than the \J (3 /a 
dependency in the lower bound in Also, the explicit dependency of sample complexity on 
dimension d makes the proposed algorithm inappropriate for non-parametric settings. 
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5 Analysis 



Now we turn to proving the main theorem. The proof will be given in a scries of lemmas 
and theorems where the proof of few are given in the appendix. The proof makes use of 
the Bernstein inequality for martingales, idea of peeling process, self-bounding property of 
smooth loss functions, standard analysis of stochastic optimization, and novel ideas to derive 
the claimed sample complexity for the proposed algorithm. 

The proof of Theorem [5] is by induction and we start with the key step given in the 
following theorem. 

Theorem 7. Assume epnoi > Copt- For a fixed stage k, if ||wfe — w,^|| < A^, then, with a 
probability 1^5, we have 

||wfe+i - W»||^ < aAl + feeprior 

where 

2 



aTi 

and s is given in provided that ^ > 16/3 /a and rj ~ 1/ {2^(3\/T[) hold. 

For now, we proceed with the proof of Theorem 31 returning later to establish the claim 
stated in Theorem [T] 

Proof of Theorem^ By setting a and b in ^ in Theorem[7]as a < e and b < 2t / j3, we have 
£. > ^P/{aT) and 

Ti<— f 2C/3v/t\ + k'^Pd + 2S,pVd 
ae \ L J 



2 

ae 

implying that 

( ePd + 2^l3V~d ^ s iQep^ \ 

J 1 > 4 max In - , — . 

\ ea a^e^ / 

Thus, using Theorem [7] and the definition of ^ and Ti, we have, with a probability 1 — S, 

Afc+l < eAfc + ^eprior. 

After m stages, with a probability 1 — mS, we have 

9-r ""^ 9-7- 

a2 ^ ^rn a2 , fl, . p» < -J- r ■ 

^ ^ ^1 + ^Cprior ^ £ ^ £ + _ Cpnor- 

By the /3-smoothness of C{w) it implies that 

^ ^£ ^1 + - ^Cpriorj 

where the last inequality follows from Ai < The bound stated in the theorem follows 
the assumption that £(w*) = Eopt < Cprior- D 



9 



5.1 Proof of Theorem [7] 

To bound || w^+i — w» || in terms of A/j, we start with the standard analysis of onhne learning. 
In particular, from the strong convexity assumption of £(w) and updating rule © we have, 

£(wl.) - Aw,) < (V/:(w*),w* -w,)-|i|w^-w,f 

- (vl, w*. - w,) + (V/:(w*.) - V* , w* - w,) - |||wt - 



< 



f (V/:(w* ) - V* , w* - w,) -^||w, - w,|p, (10) 



where the last step follows from ||vj.|| < 7fcVd. By adding all the inequalities of (fTUl) at stage 
fc, we have 



t=i ' t=i t=i 



where Vfc and Wk are defined as V/j = X)tli "I- ^^"^ = X^tli ll'^^fc ^ respectively. In 

order to bound Vk, using the fact that V£(w|) = Ef[g^], we rewrite Vk as 

t=l 



where Dk = X^tii "^l ^^^'^ = Y^JLi el- which represent the variance and bias of the clipped 
gradient v^, respectively. We now turn to separately upper bound each term. 

The following lemma bounds the variance term Dk using the Bernstein inequality for 
martingale. Its proof can be found in Appendix El 

Lemma 1. For any L > and /i > 0, we have 

Pr (wk < '-^f^) + Pr {^Dk < \Wk + (h^ld + jk^kVd) In ^) > 1 - <5 
where s is given by 

8/3 fiR'^ 
log2 

^prior 

The following lemma bounds Ek using the self-bounding property of smooth functions 
and the proof is deferred to Appendix [b1 
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Lemma 2. 

4Ti 4/3 4Ti 4/3 

^fc < ^eopt + -jWk < ^eprior + y VFfc. 

Note that without the knowledge of Cprior, we have to bound eopt by ri(l), resuhing in a 
very loose bound for the bias term Ek ■ It is knowledge of the target expected risk Cpiior that 
allows us to come up with a significantly more accurate bound for the bias term Ek, which 
consequentially leads to a geometric convergence rate. 

We now proceed to bound YlJ^i 'C(w^) — £(w,) using the two bounds in Lemma [Hand 
[21 To this end, based on the result obtained in Lemma [1] we consider two scenarios. In the 
first scenario, we assume 

< (12) 



In this case, we have 



^/:(wi.) - /:(w,) < ^Wk < ^Ti. (13) 

In the second scenario, we assume 

Dk < jWt + [l-fld + 7fcAfcx/d) In ^. (14) 
In this case, by combining the bounds for Dk and Ek and setting L ^ j^, we have 

Vk < |M/,+ (|7.^+7.A.x/rf)ln^ + ^W 



^Wk + {ef^d + 2e/3^/d) A? In ^ + 



where the last equality follows from the fact ^k — 25/3Afc. If we choose ^ such that ^ < ^ 
or C > — > 1 holds, we get 



Vk < ^Wk + [ei3d + 2^/3Vd) Al In ^ + ^e^. 



Substituting the above bound for Vk into the inequality of (jlip . we have 

5^ C{wi) - /:(w.) < ^ + ^jIT, + [epd + 2^/3Vd) Al\n^ + ^CpHor 

By choosing r] as rj = = j^^^, we have 

£(wfc+i) - £(w,) < ^ (2^PVt'i + [eiid + 2^l3Vd] In ^) A^ + ^/:(w,). (15) 
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By combining the bounds in (fT5|) and ([T5)) . under the assumption that at least one of the two 
conditions in and (|14p is true, by setting ^ = B/8, we have 

C{^k+i) - C{w*) < ^ (2^13VTi + [ei3d + 2e/3Vd] In ^) + |£(wO, 
implying 



|Wfe+i - wJI < 



In- 



A2+ /:(w,). 



We complete the proof by using Lemma [H which states that the probability for either of the 
two conditions hold is no less than 1 — S. 



6 Conclusions 

In this paper, we have studied the sample complexity of passive learning when the target 
expected risk is given to the learner as prior knowledge. The crucial fact about target 
risk assumption is that, it can be fully exploited by the learning algorithm and stands in 
contrast to most common types of prior knowledges that usually enter into the generalization 
bounds and are often perceived as a rather crude way to incorporate such assumptions. We 
showed that by explicitly employing the target risk eprior in a properly designed stochastic 
optimization algorithm, it is possible to attain the given target risk Cprior with a logarithmic 

sample complexity log ^ j , under the assumption that the loss function is both strongly 
convex and smooth. 

There are various directions for future research. The current study is restricted to the 
parametric setting where the hypothesis space is of finite dimension. It would be interesting 
to see how to achieve a logarithmic sample complexity in a non-parametric setting where 
hypotheses lie in a functional space of infinite dimension. Evidently, it is impossible to extend 
the current algorithm for the non-parametric setting; therefore additional analysis tools are 
needed to address the challenge of infinite dimension arising from the non-parametric setting. 



A Proof of Lemma [T] 

The proof is based on the Bernstein inequality for martingales (see, e.g., [3]). 

Lemma 3. (Bernstein inequality for martingales). Let Xi, . . . , X„ be a hounded martingale 
difference sequence with respect to the filtration J- = {J'i)i<i<n md with \\Xi\\ < K. Let 
Si ~ X]j=i -^3 associated martingale. Denote the sum of the conditional variances by 

n 



Then for all constants t, v > Q 
Pr 



max Si > t and < 

-l,...,n 



< exp 



2{i^ + Kt/3) 
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and therefore, 



Pr 



max Si > V2iyt + —Kt and S;, < u 



< e" 



Proof of Lemma\^ Define martingale difference d\, = (wj^, — w,, Ef [v|,] — v^) and martingale 
Z?fc = X^tli '^fc- Let denote the conditional variance as 



T 



which follows from the Cauchy's Inequality and the definition of clipping. Define K = 
maxjd^l < 2-\/d7fcAfe. To prove the inequality in Lemma [H we follow the idea of peeling 

process [IS]. Since Wk < 4_R^Ti, we have 



Pr(Dk> 2jkVWkdT + V2KT/3 



= Pr Dk > 27fc ^/WkdT + V2/v r/3, Wk < m^T^ 



= Pr i^Dk > 27fc ^/WkdT + V2AV/3, 1]| < jldWk , Wk < 4R^Ti 
< Pr (^Dk > 2jkVWkdT + V2Kt/3, I]| < -/ldWk,Wk < epriorTi/(2/3Ai) 

+ ^ Pr fok > 2-fkVWkdT + V2Kt/3, E^, < ^IdWk. ''p™''^! 



i=l 



2(3fx 



< Wk < 



^ f^prior-'- 1 



< Pr Wk < 



< Pr r < 



^prior^l 

2/3/i 

where s is given by 



logs 



"^pnor 



The last step follows the Bernstein inequality for martingales. We complete the proof by 
setting r = lii{s/S) and using the fact 

1 



2jkVWkdT < -Wk + ilLdT. 



□ 
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B Proof of Lemma [2] 

To bound Ek, we need the following two lemmas. The first lemma bounds the deviation of 
the expected value of a clipped random variable from the original variable, in terms of its 
variance (Lemma A. 2 from [T3]). 

Lemma 4. Let X be a random variable, let X = clip(X, C) and assume that \¥,[X]\ < C/2 
for some C > 0. Then 

|E[X]-E[X]| <||Var[X]| 

Another key observation used for bounding Ek is the fact that for any non-negative /3- 
smooth convex function, we have the following self-bounding property. We note that this 
self-bounding property has been used in [57| to get better (optimistic) rates of convergence 
for non-negative smooth losses. 

Lemma 5. For any /3-smooth non-negative function / : M — > M, u;e have |/'(u')| < \/ 4/3 f{w) 

As a simple proof, first from the smoothness assumption, by setting wi = 11)2 — ^f'iw^) 
in (III) and rearranging the terms we obtain /(W2) — f{wi) > j^\f'{'W2)\'^- On the other hand, 
from the convexity of loss function we have f{wi) > f'{w2) + (/'(lui), wi — 11)2). Combining 
these inequalities and considering the fact that the function is non-negative gives the desired 
inequality. 

Proof of Lemma\^ To apply the above lemmas, we write as 

d 

el = Y.^t [€'((wL X*,), 2/0 [xl]. - clip (7fe,^'((wLxi),yO[xl]0] K - w,], 

4=1 

In order to apply Lemma 31 we check if the following condition holds 

7fe>2|E4£'((wi,xi),yt)[xl]4]| (16) 

Since 

\Et [e'{{wl^l),yt) [xl.],]| 

< \Et [{i' {{wl^l),yt) - e ((w,,x*,),yi)} Kl] I + ((w„xl,),yt) [x*,],] | 

< /3||w*,-w,|| </3Afc 

where the last inequality follows from Et [I' ((w,, x|,), yt) [x|,]i] = since Vir* is the minimizer 
of £(v^r), we thus have 

7fe = 2^/3Afc > 2/3Afc > 2 |E, [t' ((w^,x^),yt) [x*,],] | 

where ^ > 1, implying that the condition in p6p holds. Thus, using Lemma 21 we have 

d ^ 

(^'((wl.,x*),y,))' 




< 



-E, 
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Using Lemma [SJ we further simplify the above bound for e^, as 

Tfc 

Ik 

where the third inequahty follows from ||wj, — w, ||oo < ||w|, — w*|| < and Lemma [S] and 
therefore 

t=i t=i ^ t=i ^ t=i 

< ^/:(w,) + ^^||wl,-w,|p 

where the second inequality follows from the smoothness assumption of C{mv). □ 

References 

\V\ M. Anthony and P.L. Bartlett. Neural network learning: Theoretical foundations. Cam- 
bridge University Press, 1999. 

Maria-Fiorina Balcan, Steve Hanneke, and Jennifer Wortman Vaughan. The true sample 
complexity of active learning. Machine Learning, 80(2-3):lll-139, 2010. 

[3] P.L. Bartlett, O. Bousquct, and S. Mcndelson. Local radcmacher complexities. The 
Annals of Statistics, 33(4):1497-1537, 2005. 

[4] Shai Ben-David, David Pal, and Shai Shalcv-Shwartz. Agnostic online learning. In 
COLT, 2009. 

[5] Anselm Blumer, A. Ehrenfeucht, David Haussler, and Manfred K. Warmuth. Learnabil- 
ity and the vapnik-chervonenkis dimension. J. ACM, 36(4):929-965, 1989. 

Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, learning, and games. Cambridge 
University Press, 2006. 

Andrew Cotter, Ohad Shamir, Nati Srebro, and Karthik Sridharan. Better mini-batch 
algorithms via accelerated gradient methods. In NIPS, pages 1647-1655, 2011. 



15 



[8] John C. Duchi, Peter L. Bartlett, and Martin J. Wainwright. Randomized smoothing 
for stochastic optimization. SIAM Journal on Optimization, 22(2):674-701, 2012. 

[9] A. Ehrcnfcucht, D. Hausslcr, M. Kearns, and L. VaUant. A general lower bound on the 
number of examples needed for learning. Information and Computation, 82(3):247-261, 
1989. 

[10] Steve Hanneke. Theoretical Foundations of Active Learning. PhD thesis, 2009. 

[11] Elad Hazan, Adam Kalai, Satyen Kale, and Amit Agarwal. Logarithmic regret algo- 
rithms for online convex optimization. In COLT, pages 499-513, 2006. 

[12] Elad Hazan and Satyen Kale. Beyond the regret minimization barrier: an optimal 
algorithm for stochastic strongly-convex optimization. COLT, 2011. 

[13] Elad Hazan and Tomer Koren. Optimal algorithms for ridge and lasso regression with 
partially observed attributes. CoRR, 2011. 

[14] Anatoli louditski and Yuri Nesterov. Primal-dual subgradient methods for 
minimizing uniformly convex functions. available at http://hal.archives- 

ouvertes.fr/docs/00/50/89/33/PDF/Strong-hal.pdf, 2010. 

[15] Sham M. Kakade, Karthik Sridharan, and Ambuj Tewari. On the complexity of linear 
prediction: Risk bounds, margin bounds, and regularization. In NIPS, pages 793-800, 
2008. 

[16] Vladimir Koltchinskii. Oracle Inequalities in Empirical Risk Minimization and Sparse 
Recovery Problems. Lecture Notes in mathematics. Springer, 2011. 

[17] Wee Sun Lee, Peter L. Bartlett, and Robert C. Williamson. The importance of convexity 
in learning with squared loss. IEEE Transactions on Information Theory, 44(5):1974- 
1980, 1998. 

[18] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation 
approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574~1609, 
2009. 

[19] A.S. Nemirovsky and D.B. Yudin. Problem complexity and method efficiency in opti- 
mization. Wiley Interscience Series in Discrete Mathematics, 1983. 

[20] Yurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer 
Academic Publishers, 2004. 

[21] Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent 
optimal for strongly convex stochastic optimization. In ICML, 2012. 

[22] Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Online learning: Random 
averages, combinatorial parameters, and learnability. CoRR, abs/1006.1138, 2010. 

[23] Aaditya Ramdas and Aarti Singh. Optimal stochastic convex optimization through the 
lens of active learning. CoRR, abs/1207.3012, 2012. 



16 



[24] S. Shalev-Shwartz, O. Shamir, K. Sridharan, and N. Srebro. Learnability and stability 
in the general learning setting. COLT, 2009. 

[25] S. Shalev-Shwartz, O. Shamir, K. Sridharan, and N. Srebro. Stochastic convex optimiza- 
tion. COLT, 2009. 

[26] Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability, 
stability and miiform convergence. Journal of Machine Learning Research, 11:2635-2670, 
2010. 

[27] Nathan Srebro, Karthik Sridharan, and Ambuj Tewari. Smoothness, low noise and fast 
rates. In NIPS, pages 2199-2207, 2010. 

[28] Karthik Sridharan. Learning from an optimization viewpoint. PhD Thesis, 2012. 

[29] Karthik Sridharan, Shai Shalev-Shwartz, and Nathan Srebro. Fast rates for regularized 
objectives. In NIPS, pages 1545-1552, 2008. 

[30] V.N. Vapnik and A.Y. Chervonenkis. On the uniform convergence of relative frequencies 
of events to their probabilities. Theory of Probability and Its Applications, 16(2):264-280, 
1971. 



17 



